💡 AI 인사이트

🧠 코드 입력

🤖 AI가 여기에 결과를 출력합니다...

댓글 커뮤니티

이 포스팅은 쿠팡 파트너스 활동의 일환으로, 이에 따른 일정액의 수수료를 제공받습니다.

검색

로딩 중이에요... 🐣

[코담] 웹개발·실전 프로젝트·AI까지, 파이썬·장고의 모든것을 담아낸 강의와 개발 노트

Pandas Part4 Scikit learn생존자 예측모델 | ✅ 편저: 코담 운영자

📖 데이터 분석의 시작, Pandas 완전 정복 - Part 4: Scikit-learn으로 생존자 예측 모델 만들기

✨ 서론: 데이터 분석에서 머신러닝으로

앞선 Part 1~3에서 Pandas로 데이터를 불러오고 전처리하며 EDA를 수행했습니다. 이제는 Scikit-learn을 활용해 Titanic 승객의 생존 여부를 예측하는 머신러닝 모델을 구축합니다.

💡 실무 팁: EDA로 발견한 패턴과 파생 변수는 머신러닝 모델링 단계에서 큰 차이를 만듭니다.

🛠️ Step 1: 필요한 라이브러리 불러오기

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

📦 Step 2: 데이터 로드 및 전처리

# Titanic 데이터 로드
df = pd.read_csv('titanic.csv')

# 결측치 처리
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# 범주형 데이터 인코딩
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['Embarked'] = df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# 특징과 타겟 정의
features = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Parch', 'Embarked']
X = df[features]
y = df['Survived']

🔀 Step 3: 데이터 분할 (훈련용/테스트용)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

🤖 Step 4: 모델 학습 및 예측

# Random Forest 모델 생성 및 학습
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 테스트 데이터 예측
y_pred = model.predict(X_test)

📊 Step 5: 모델 평가

정확도 및 리포트 출력

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

혼동 행렬 시각화

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

💡 실무 팁: Random Forest 외에도 Logistic Regression, SVM 등 다양한 알고리즘을 시험해볼 수 있습니다.

🚀 Step 6: Feature Importance 시각화

# 특징 중요도 확인
importances = model.feature_importances_
feature_names = X.columns

# 시각화
sns.barplot(x=importances, y=feature_names)
plt.title('Feature Importances')
plt.show()

📌 결론 및 다음 단계

모델 정확도와 중요한 변수 확인
추가 Feature Engineering으로 모델 개선 가능
Hyperparameter Tuning(GridSearchCV 등)을 적용해 성능 향상

🎯 다음 단계: 이 모델을 웹 애플리케이션으로 배포하여 실제 사용자 입력을 통한 예측 서비스 구현

← 이전: Pandas Part3 실전 데이터 분석 프로젝트

다음 →: pandas Part5 머신러닝 모델 웹 앱으로 배포

Python 코드 실행기

Python 코드 입력:

📝 입력값 (자동 생성됨)

📤 실행 결과:

TOP